Skip to content

Retry OAuth token refresh on infrastructure 4xx#5170

Open
gkatz2 wants to merge 1 commit intostacklok:mainfrom
gkatz2:fix/retry-4xx-token-refresh-5169
Open

Retry OAuth token refresh on infrastructure 4xx#5170
gkatz2 wants to merge 1 commit intostacklok:mainfrom
gkatz2:fix/retry-4xx-token-refresh-5169

Conversation

@gkatz2
Copy link
Copy Markdown
Contributor

@gkatz2 gkatz2 commented May 3, 2026

Summary

  • The OAuth token refresh endpoint returns 4xx for infrastructure events (WAF/firewall blocks, rate limits, transient bad-config deploys) as well as for OAuth protocol failures. Today every 4xx is treated as a permanent auth failure, killing the workload until manual re-authentication. A momentary VPN flap that routes a refresh request through a non-allowlisted IP is enough to leave the workload dead.
  • Treat 4xx responses that lack a structured RFC 6749 error code as transient, so they enter the existing retry loop added in Retry OAuth token refresh on server errors #4513. Only 4xx with a populated error code (invalid_grant, invalid_client, etc.) remains permanent. 429 is always transient per HTTP standard.

Fixes #5169

Type of change

  • Bug fix

Test plan

  • Unit tests (task test)
  • Linting (task lint-fix)
  • Manual testing (describe below)

Added TestIsTransientNetworkError_AgainstRealOAuth2Library, which exercises the classification function end-to-end through golang.org/x/oauth2.Config.TokenSource against an httptest.Server returning each response shape (HTML 403, HTML 401, JSON invalid_grant, JSON invalid_client, 429, 503). This pins the assumption that RetrieveError.ErrorCode is populated only for parseable RFC 6749 error responses, so a future oauth2 upgrade that changes that behavior would surface here, not in production.

API Compatibility

  • This PR does not break the v1beta1 API, OR the api-break-allowed label is applied and the migration guidance is described above.

Does this introduce a user-facing change?

Remote MCP servers with OAuth authentication now survive transient 4xx token-endpoint events (WAF blocks, rate limits, brief upstream config errors) instead of permanently going dark on the first occurrence.

Special notes for reviewers

Classification rule

golang.org/x/oauth2 populates RetrieveError.ErrorCode only when the response body is parseable JSON containing an RFC 6749 error field. An empty ErrorCode therefore signals an infrastructure-level response (HTML page from a WAF, CDN, or reverse proxy), not an OAuth protocol failure. The new classifyOAuthRetrieveError helper applies this rule:

  • 5xx: transient (existing behavior, from Retry OAuth token refresh on server errors #4513)
  • 429: transient regardless of body (HTTP standard)
  • 4xx with empty ErrorCode: transient (infrastructure error)
  • 4xx with populated ErrorCode: permanent (OAuth protocol failure)

Scope

Classification only — no changes to retry parameters, monitor lifecycle, or workload state machine. Outages that span longer than a single refresh attempt (e.g. a multi-hour VPN outage that crosses a token expiry boundary) still result in the workload being marked unauthenticated; that's a separable architectural change requiring a new "transiently failing" state and is not part of this fix.

Interaction with #5044

PR #5044 (in flight) emits a DCR remediation hint when classification returns permanent 4xx. This change reduces false triggers of that hint: the hint will fire for invalid_grant/invalid_client (where DCR creds genuinely may be stale) but no longer for WAF blocks (where the hint would have been misleading). Pure mechanical merge conflict on the same file, no semantic conflict.

Generated with Claude Code

@github-actions github-actions Bot added the size/S Small PR: 100-299 lines changed label May 3, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.82%. Comparing base (9572f6a) to head (10b8e5e).
⚠️ Report is 18 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5170      +/-   ##
==========================================
+ Coverage   67.65%   67.82%   +0.16%     
==========================================
  Files         607      610       +3     
  Lines       61982    62407     +425     
==========================================
+ Hits        41937    42325     +388     
- Misses      16883    16910      +27     
- Partials     3162     3172      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jhrozek
jhrozek previously approved these changes May 5, 2026
Copy link
Copy Markdown
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Classification logic is RFC 6749 / 6585 compliant, the real-library contract test is the right way to defend against future golang.org/x/oauth2 upgrades changing ErrorCode semantics, and singleflight already prevents the cross-replica refresh-rotation race from getting worse. PR scope is tight (classification only, ~206 LOC).

Two non-blocking comments below — both are suggestions for follow-up scope and a small test-design refinement, not anything that should hold this up.

return true
}

if statusCode == http.StatusTooManyRequests {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

429 is now transient, but Retry-After is ignored — the existing 10s → 2m exponential backoff applies regardless of what the server suggests. Real providers (GitHub, Google) routinely set the header, so a Retry-After: 600 will get hit ~5–30 times before the 5-minute cap stops things, instead of waiting the requested cooldown. RFC 6585 §4 says it's MAY, so ignoring it is technically compliant.

Honoring it cleanly would mean threading a "min wait" hint from the classifier into the retry loop, which is a bigger change than this PR's classification-only scope. Would you want to file a follow-up issue (or follow-up PR) for it?

// RetrieveError.ErrorCode for a given response shape. This test pins that
// assumption so a future oauth2 upgrade that changes ErrorCode population
// would surface here, not in production.
func TestIsTransientNetworkError_AgainstRealOAuth2Library(t *testing.T) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All six subtests here mirror cases already covered by the synthetic table in TestMonitoredTokenSource_BackgroundMonitor_ErrorClassification (400 invalid_grant, 429 empty body, 503, HTML 401/403, etc.).

The doc comment frames the real value well — pinning the golang.org/x/oauth2 ErrorCode-population contract so a future library upgrade that changes that behavior fails here, not in production. That's worth keeping. But as written it reads like overlapping unit coverage, which .claude/rules/testing.md flags ("Avoid overlapping test cases exercising the same code path" / "Tests must only test code in the package under test").

Two ways to make the intent visible:

  1. Trim to just the cases where the synthetic helpers could diverge from real library output — i.e. the HTML/WAF shapes (where the empty-ErrorCode premise is non-obvious), plus an edge case like a form-encoded error body. Drop the JSON invalid_grant/invalid_client cases that the synthetic table already pins.
  2. Move to a contract_test.go with a header comment noting its role is dependency pinning, not unit coverage of the package's own logic.

Either works — preference?

Copy link
Copy Markdown
Contributor Author

@gkatz2 gkatz2 May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with option (1) in the latest amend. I trimmed to 3 cases – kept HTML 401 and HTML 403 (WAF), dropped the JSON and 5xx/429 cases (synthetic table covers them), and added a 400-with-form-encoded-invalid_grant case where the synthetic helper would lie (golang.org/x/oauth2 populates ErrorCode from form-encoded bodies – internal/token.go:288 in v0.34.0). I tightened the doc comment to frame the test as dependency-pinning, not unit coverage.

The OAuth token refresh endpoint returns 4xx for infrastructure
events (WAF/firewall blocks, rate limits, transient bad-config
deploys) as well as for OAuth protocol failures. Today every 4xx
is treated as a permanent auth failure, killing the workload
until manual re-authentication. A momentary VPN flap that routes
a refresh request through a non-allowlisted IP is enough to
leave the workload dead.

Treat 4xx responses that lack a structured RFC 6749 error code
as transient; only 4xx with a populated error code
(invalid_grant, invalid_client, etc.) remains permanent. 429 is
always transient per HTTP standard.

Fixes stacklok#5169

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Greg Katz <gkatz@indeed.com>
@gkatz2 gkatz2 force-pushed the fix/retry-4xx-token-refresh-5169 branch from 9de8e4f to 10b8e5e Compare May 7, 2026 18:06
@gkatz2
Copy link
Copy Markdown
Contributor Author

gkatz2 commented May 7, 2026

I force-pushed an amended version with the rebase plus a few additional refinements. Headline:

  1. Rebased onto current main (clean conflict resolution against Wire authserver DCR resolver and add structured logs #5044 and Set User-Agent on OAuth token refresh requests #5168).

  2. Tightened isPermanentTokenEndpointError to be the strict inverse of isTransientRetrieveError on the *oauth2.RetrieveError branch. The DCR remediation Warn now fires only when the response carries a populated RFC 6749 error code — not on 4xx-with-empty-ErrorCode (HTML pages from a WAF, CDN, or reverse proxy). The reasoning: a non-spec-compliant response means the OAuth server didn't render a verdict, so recommending "delete cached credentials" based on it would frequently mislead operators whose real problem is upstream of the OAuth server. This is a behavior tweak to code originally added in Wire authserver DCR resolver and add structured logs #5044 — flagging this for @tgrunnagle in case you want to weigh in.

  3. Renamed classifyOAuthRetrieveErrorisTransientRetrieveError for naming symmetry with isPermanentTokenEndpointError and to match the file's is<Adjective><Noun>Error convention.

  4. Added TestIsPermanentTokenEndpointError to assert the new inverse behavior directly. The function had 100% line coverage before, but no test asserted on its return value, so the semantic change wouldn't have failed any existing test.

  5. Trimmed TestIsTransientNetworkError_AgainstRealOAuth2Library per your comment.

Please re-review when you have a chance, @jhrozek. The diff between the previous approval state and current HEAD is centered on pkg/auth/monitored_token_source.go around isPermanentTokenEndpointError / isTransientRetrieveError.

@github-actions github-actions Bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Small PR: 100-299 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Token refresh treats transient 4xx as permanent auth failures

2 participants